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ABSTRACT 

A  new  approach,  micro  benchmarks,  has  recently  been  developed. 
Using  this  technique,  we  have  analyzed  the  KSRl,  and  in  particu¬ 
lar  the  "ALLCACHE"  memory  arc^tecture  and  ring  intercormec- 
don.  We  have  been  able  to  elucidate  many  facets  of  memory  per¬ 
formance.  The  technique  has  enabled  us  to  identify  and  character¬ 
ize  parts  of  the  memory  design  not  described  by  Kendall  Square 
Research.  Our  results  show  that  a  miss  in  the  local  cache  can  incur 
a  penalty  ranging  from  1.5  microseconds  to  500  microseconds 
(when  a  dirty  "page"  in  the  local  cache  must  be  evicted).  The  pro¬ 
grammer  mtist  be  very  careful  in  placement  and  accessing  of  data 
to  obtain  maximum  performance  from  die  KSRl;  die  data 
presented  here  wUl  help  in  understanding  the  performance  actually 
obtained. 

1.  Introduction 

The  KSRl  from  Kendall  Square  Research  is  a  novel  new  parallel 
computer.  It  is  the  first  commercial  maclune  embodying  a  scal¬ 
able  all  cache  form  of  shared  memory  architecture,  fri  addidon, 
diere  are  a  number  of  other  interesting  features  of  the  machine. 

We  report  our  observadons  of  the  KSRl,  obtained  by  means  of  a 
suite  of  small  benchmarks  diat  expose  the  details  of  the  machine 
characterisdes.  We  refer  to  these  small  benchmarks  as  micro 
benchmarks.  In  secdon  2  we  briefly  describe  the  micro  beitch- 
mark  qiproach,  and  its  applicadon  to  parallel  machines.  The 
micro  bmehmark  suite  has  been  developed  and  used  to  analyze  the 
performance  of  uniprocessor  machines  [12,  13].  We  describe  the 
architecture  of  the  KSRl  as  we  understand  it  in  secdon  3.  We 
have  run  our  standard  micro  benchmark  suite  for  processor  perfor¬ 
mance,  and  included  the  results  together  with  comparadve  results 
from  two  other  CPUs  of  interest  The  main  focus  of  our  work  has 
been  to  understand  and  measure  the  performance  of  the  KSRl’s 
novel  "ALLCACHE"  memory,  wUch  we  have  extensively 
analyzed  with  a  new  set  of  micro  benchmarks.  This  work  and  die 
results  are  described  in  seedem  5.  Secdon  6  analyses  a  set  of 
experiments  used  to  measure  the  efiect  of  contendem  in  the  inter- 
connecdon  network. 

2.  The  Micro  Benchmark  Approach 

Recendy,  one  of  us  (Saavedra)  has  explored  a  new  ^iproach  to 
benchmark  analysis  of  computers.  This  ^iproach  has  b^  d<x;u- 
mented  in  several  papers  [12,  13,  14].  The  approach  was 
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developed  in  reaedem  to  the  use  of  large  applicadons  as  bench¬ 
marks.  Though  it  is  hopied  that  large  applicadons  will  be  more 
representadve  of  real  workloads  than  synthedc  benchmarks  or 
small  kernels,  it  is  not  clear  what  features  of  a  particular  system 
they  exercise,  or  what  actually  accounts  for  die  differences  in  the 
performance  of  these  benchmarks  on  different  machines. 

The  micro  benchmark  approach  returns  to  the  idea  of  measuring 
specific  features  of  the  machine.  But  in  contrast  to  measuring  only 
a  few  parameters,  such  as  floating  point  muldply,  the  qipnoach 
consists  of  (1)  measuring  every  observable  feature  of  the  machine, 
and  (2)  making  use  of  the  collected  set  of  data  in  an  integrated 
wty.  For  uniprocessors,  one  of  the  most  powerful  ways  of  using 
die  1111(70  boichmarks  is  to  predict  the  performance  of  a  program 
on  a  new  computer  without  first  porting  the  program  and  measur¬ 
ing  the  results.  This  is  done  by  analyzing  the  program  to  deter¬ 
mine  how  much  use  is  made  of  each  of  the  machine  features  meas¬ 
ured  by  the  micro  benchmarks,  and  then  using  the  results  of  the 
exetxidon  of  only  the  micro  benchmark  suite  on  the  new  computer 
to  predict  the  running  time  of  die  program.  Using  the  qiproach,  it 
has  proven  possible  to  estimate  accurately  die  performance  of  a 
wide  range  of  programs,  including  standard  benchmark  suites  such 
as  Spec  and  Perfect  [16,  3].  Further,  the  analysis  of  diese  pro¬ 
grams  in  terms  of  the  features  measured  by  the  micro  benchmark 
suite  gives  insight  into  the  reasons  for  the  observed  performance 
differences  for  he  programs  cm  different  coiityuters. 

An  important  factor  in  machine  performance  is  cache,  memory, 
and  network  interconnect  behavior.  A  test  has  been  developed  that 
reveals  a  great  deal  about  the  memory  hierarchy  behavior  (includ¬ 
ing  the  network),  and  a  way  of  (iisplaying  the  data  from  this  test 
using  a  set  of  diagrams  that  we  have  named  Physical  and  Perfor¬ 
mance  Profiles  (or  P^  diagrams)  has  been  developed.  We  will 
explain  this  test,  and  the  main  characteristics  of  the  P^  diagrams, 
below.  The  data  thus  obtained,  together  with  information  about 
die  rate  of  misses  in  a  program,  is  factored  into  the  other  micro 
benchmark  results  as  part  of  the  prediedon  meduxlology. 

This  paper  reports  results  of  the  micro  benchmark  analysis  of  the 
KSRl.  The  machine  contains  many  features  (described  below)  not 
found  in  other  machines.  To  undemtand  the  performance  implica- 
dons  of  these  features,  it  has  been  necessary  to  develop  addidonal 
micro  benchmarks  beyond  the  inidal,  generd  purpose  suite.  These 
are  described  later. 

An  c>pen  and  interesting  quesdon  is  die  analysis  of  parallel  pro¬ 
grams,  and  the  prediedon  of  their  performance  through  the  use  of 
micro  benchmarks.  This  involves  developing  new  tests  for  syn- 
(hronizadon,  access  to  shared  variables,  and  probably  features  pro¬ 
vided  in  run  time  libraries  for  parallel  machines. 
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Figure  1:  KSRl  cache  and  subcadie  organization. 


However,  it  is  not  clear  at  this  point  which  factors  are  relevant  to 
take  into  account  in  building  a  reasonable  model  of  program  exe¬ 
cution  which  can  produce  predictions  about  the  execution  time  of 
parallel  programs.  Furthermore,  we  believe  that  a  substantial 
experimental  understanding  about  the  performance  regimes  exhi¬ 
bited  by  shared  memory  machines  is  required  before  any  such 
model  can  be  developed.  We  will  report  separately  our  results  for 
message  passing  machines,  and  for  other  shared  memory  machines 
such  as  the  Stanford  DASH  [8],  for  which  we  have  also  developed 
new  micro  benchmarks. 

As  will  be  seen  below,  our  iqjproach  reveals  a  great  deal  of 
interesting  information  about  the  KSRl.  There  is  much  more 
work  to  be  done,  however,  to  extend  the  approach  into  the  area  of 
prediction. 

3.  Architecture  of  the  KSRl. 

The  KSRl  is  a  new  architecture  for  parallel  machines,  and  one  that 
attempts  to  solve  flie  problem  of  scalability  in  shared  memory  mul¬ 
ticomputers.  The  problem  is  that  as  the  number  of  processors 
increases,  the  average  cost  of  access  to  memory  goes  up.  One 
^tproach  has  been  to  design  faster  memory  interconnects,  together 
with  highly  interleaved  manory.  This  rqjproach  appears  to  be  lim¬ 
ited  to  at  most  a  few  hundred  processors. 

An  alternate  approach,  explored  by  both  Kendall  Square  and  the 
Stanford  DASH  project,  is  to  use  a  directory-based  caching 
scheme  with  a  relatively  large  amount  of  memory  close  to  each 
individual  processor.  Both  of  these  machines  use  a  "message¬ 
passing"  interconnect  mechanism  rather  titan  more  typical  memory 
interconnect  methods.  In  the  case  of  the  KSRl,  it  is  a  ring  of  rings 
[6],  while  the  Stanford  DASH  uses  a  2  dimensional  mesh  intercon¬ 
nection  [8]. 

The  KSRl  is  organized  as  a  ring  of  rings,  with  up  to  thirty  two 
processors  coimected  to  each  lowest  level  ring,  called  ringK). 
RingtO  rings  are  intercoimected  by  another  ring,  ring:l.  Each  node 
consists  of  a  CPU,  a  512-KByte  subcache,  32  MBytes  of  cache 


memory,  a  cache  directory,  a  ring  interface,  and  an  I/O  interface. 
The  directory  supports  the  cache  coherency  protocol  between  pro¬ 
cessors  and  between  rings.  A  total  of  1088  processors  cm  be  con- 
itected  as  34  ringrO  rings  intercormected  by  one  ring:l  ring.  Ken¬ 
dall  Square  Research  intends  to  extend  the  hierarchy  so  as  to  con¬ 
nect  even  mote  processors  in  future  designs. 

The  processing  component  has  four  functional  units:  integer,  float¬ 
ing  point,  control,  and  an  I/O  unit  Instructions  are  issued  in  pairs: 
m  integer  or  floating  point  instruction  paired  with  a  control  or  I/O 
instruction.  The  machine  is  a  load/store  architecture,  with  loads 
md  stores  issued  by  the  control  unit.  Some  floating  point  mstruc- 
tions  result  in  two  floating  point  operations.  There  are  32  intega 
registers,  64  floating  point  registers,  md  32  addressing  registers. 
The  integer  and  floating  point  registers  hold  64  bits,  the  addressing 
registers  are  40  bits.  The  machine  operates  at  20  megahwtz  md  is 
folly  pipelined,  witii  two  brmch  delay  slots. 

The  user  view  of  memory  (called  the  Context  Address  -  CA)  is  a 
segmented  address  space.  Segments  cm  rmge  from  2“  to  2*® 
bytes  (in  the  current  implementation)  in  length.  The  processor 
produces  40  bit  addresses,  interpreted  as  a  segment  number  md 
offoeL  The  processor  contains  m  instruction  segment  table,  with  8 
entries,  md  a  data  segment  table  of  16  entries. 

The  addresses  gmerated  by  the  processor  are  translated  into  Sys¬ 
tem  Virtual  Address  (S  VA)  space.  T3je  segment  tables  include  the 
base  location  of  tiie  segment  in  SVA  the  segment’s  length,  md 
access  permissions.  Presumably,  the  segment  tables  are  folly  asso- 
dative. 

3.1.  KSRl  ALLCACHE  Memory  Architecture 
The  memory  in  each  processor  is  organized  as  a  two-level  cache 
hierarchy.  There  is  a  large  local  cache  that  combines  the  functions 
of  memory  and  a  second  level  cadie,  md  a  subcache  that  is  the 
first  level  cache.  Both  caches  are  mmaged  by  a  directory  structure 
consisting  of  a  large  unit  of  allocation  in  the  cache  directory,  md  a 
smaller  unit  of  data  transfer,  as  shown  in  Figure  1.  This  is  in  con- 


trast  with  more  common  cache  organizations  in  which  the  units  of 
allocation  and  transfer  are  the  same. 

There  are  instruction  and  data  subcaches,  each  256K  bytes.  Each 
subcache  is  managed  by  a  subcache  directory  with  128  blocks  each 
containing  32  subblocks.  The  directory  is  organized  as  64  two- 
way  assodadve  sets  widi  a  random  replacement  policy.  A  refer¬ 
ence  to  a  new  subcache  blodc  incurs  the  overhead  associated  with 
invalidating  all  the  siibblocks  of  the  block  being  displaced.  The 
subcache  is  cache  coheroit  with  the  local  cache.  The  subblocks 
each  contain  64  bytes  of  data.  A  read  from  the  subcache  is  satis¬ 
fied  in  3  clock  cycles  (load  instructions  have  2  delay  slots). 

There  is  a  32  megabyte  local  cadie  in  each  node.  There  is  no 
memoiy  with  a  fixed  address  (hence  the  term  "ALLCACHE"). 
Instead,  memoiy  is  managed  through  a  directory  structure  that 
makes  all  the  cache  memory  in  the  machine  visible  and  accessible 
to  eveiy  processor.  A  similar  memoiy  architecture  has  been 
describe  in  [5],  where  the  term  Cache  CMy  Memoiy  Architecture 
(COMA)  is  us^. 

The  local  cache  is  16-way  set  associative,  and  is  organized  in  16K 
byte  "pages".  Each  page  is  divided  into  128  "subpages"  of  128 
bytes.  These  subpages  are  the  unit  of  memoiy  coherence.  The 
directory  contains  an  entry  for  each  of  the  2K  pages  that  comprise 
the  memory.  Each  entry  includes  a  tag  and  the  state  of  each  sub¬ 
page,  of  which  the  visible  states  are  invalid,  read-only,  exclusive, 
and  atomic.  Exclusive  means  that  this  is  the  only  copy  in  any 
processor’s  local  cache.  Atomic  is  exclusive  and  locked. 

Neither  the  whole  32  MBytes  nor  the  16-way  associativity  is 
accessible  from  a  user  program.  The  OS  sets  aside  a  significant 
number  of  pages  (close  to  30%  of  the  total  memory)  for  its  own 
use  and  these  cannot  be  displaced  out  of  the  cache.  We  discuss  in 
detail  the  effective  associativity  experienced  by  user  programs  in 
section  S.S. 

A  consequence  of  this  design  is  that  if  die  processor  references  a 
subpage  from  a  new  page,  all  the  lines  from  another  page  present 
in  Ae  cache  may  need  to  be  evicted.  In  order  to  keep  at  least  one 
copy  of  all  pages  currently  being  referenced,  each  page  is  assigned 
a  "home"  node.  The  home  page  provides  space  for  every  subpage, 
even  if  there  are  no  valid  subpages  at  the  home  node.  Having  a 
home  node  greatly  simplifies  evicting  a  subpage,  as  there  is 
guaranteed  to  be  a  node  with  room  for  the  sub^ge.  If  the  home 
node  must  evict  a  page,  the  operating  system  will  sw^  the  page  to 
disk  or  to  another  node’s  local  cache  to  make  room  for  the  new 
page.  A  significant  fraction  of  the  node’s  memory  is  set  aside  by 
the  opierating  system  to  be  used  as  home  pages  [2]. 

The  instruction  set  includes  instructions  to  prefetch  and  poststore 
subpages.  A  prefetch  allows  the  processor  to  read  a  subpage  from 
another  node  without  having  to  stall  while  the  request  is  serviced. 
There  can  be  up  to  four  prefetches  outstanding  at  any  time.  If  this 
limit  is  reached,  then  ad^tional  prefetches  are  either  discarded  or 
die  processor  is  forced  to  block  until  one  of  the  four  pending  pre¬ 
fetches  completes.  A  field  in  the  prefetch  instruction  determines 
the  acticm  to  follow.  An  additional  field  inchcates  the  state  of  tiie 
subpage  that  is  to  be  read;  exclusive  or  read-only.  A  poststore 
instruction  causes  the  local  cache  to  broadcast  a  read-only  copy  of 
a  subpage.  All  nodes  with  the  subpage’s  page  allocated  in  heir 
cache  will  read  it,  provided  the  cache  directory  is  not  busy. 

A  multi-ring  machine  contains  a  ring  interface  in  each  ring:0  ring 
that  is  a  directory  for  the  entire  ring  (that  is,  it  contains  an  entry  for 
every  page  that  is  in  any  processor  cache  in  the  ring). 

The  data  rate  of  ring:0  is  1  gigabyte/second.  Ring:l  supports  a 
"fat"  structure  with  multiple  rings  to  provide  1.2,  2.4,  or  4.8 
Gbytes/s  bandwidth.  Increasing  the  number  of  subrings  in  a  ring:l 


stracture  reduces  the  total  number  of  available  nodes  that  can  be 
used  for  processing  in  ring:0. 

The  KSRl  provides  sequential  consistency,  which  implies  that 
writes  to  a  subpage  cannot  complete  until  idl  other  copies  present 
in  the  machine  have  been  invalidated  [7]. 

3.2.  Paging  on  the  KSRl 

The  KSRl  scheme  of  allocating  space  in  the  caches  in  large, 
page-sized  units  and  filling  in  units  of  cache  lines  (subpages) 
^ipears  to  be  a  reasonable  compromise.  If  we  consider  the  amount 
of  storage  required  for  cache  directory  information  we  see  that  the 
current  implementation  will  require  only  about  lOOKB.  This  is 
calculated  from  having  2048  page  entries,  each  of  which  includes 
an  19-bit  tag  and  at  least  3  state  bits  for  each  of  the  128  subpages. 
A  simpler  implementation  that  separately  allocated  each  subpage 
would  require  more  than  700KB  of  directory  storage. 

.  ic  cost  of  this  mctitod  is  mat  an  entire  page  must  sometimes  be 
evicted  due  to  a  subpage  miss.  There  are  2**  pages  in  SVA  that 
index  to  the  same  set  of  page  positions  (see  Figure  1),  and  the 
cache  is  16  way  set  associative  at  the  page  level.  So  sequential 
references  to  a  set  of  only  17  pages  can  (in  the  worst  pathological 
case)  cause  a  page  eviction  at  every  reference. 

An  advantage  of  the  paged  scheme  is  diat  it  matches  disk  accesses 
well.  When  accessing  a  disk,  a  system  wants  to  get  big  chunks 
from  the  disk  since  it’s  slow.  With  local  cache  pages,  KSRl  can 
quickly  clear  a  page’s  state  to  make  room,  especMy  since  the  disk 
page  is  the  same  size.  Without  cache  pages  a  system  would  have  to 
cles^  each  block  individually.  This  is  likely  to  be  done  serially, 
since  the  tags  would  be  in  sequer’  ’  RAM  locations. 

The  KSRl  implementation  sav^  ificant  amount  of  storage 
per  node.  This  also  greatly  affects  ring  interface  node  which 
must  duplicate  the  entire  state  information  of  aU  32  nodes  on  a 
ringrO.  This  is  an  important  consideration  when  kxiking  at  largo' 
configurations  of  the  KSRl,  as  the  ring  directoiy  must  not  become 
a  bottleneck  if  the  system  is  to  be  scalable.  More  important  than 
the  amount  of  storage  in  the  ring  directoiy  is  the  time  and  cost  to 
search  it:  it  requires  checking  each  of  the  512  (32  caches  *  16-way 
associativity)  entries  which  might  have  a  copy  of  the  referenced 
page. 

3.3.  Comments  on  COMA  and  NUMA 

TheCX)MA  organization  of  the  KSRl  contrasts  with  a  more  tradi¬ 
tional  directory-based  NUMA  (non-umform  memoiy  access) 
machine  such  as  die  Stanford  DASH  treating  all  of  its  main 
memory  as  a  cache.  NUMA  machines  treat  a  memory  address  as 
having  a  static,  known  location.  The  distance  between  a  processor 
and  the  various  main  memoiy  modules  differs,  leading  to  the  non- 
uniform  access  distances.  Both  architectures  commonly  use  the 
cache  block  as  the  coheroicy  unit  (i.e.  sharing  occurs  on  a  block 
basis).  Note  that  the  KSRl  and  Stanford  DASH  both  include 
caches  below  the  level  of  main  memoiy  to  improve  performance. 
The  DASH  has  first  and  second-level  caches,  while  the  KSR  has 
its  subcadie. 

One  important  difference  in  the  architectures  occurs  when  access¬ 
ing  shared  data.  Whoi  a  NUMA  misses  in  its  cache  for  con¬ 
sistency  reasons,  it  will  initiate  a  transaction  to  a  node  which  is 
functioning  as  the  "home"  for  this  address.  However,  a  CX)MA 
machine  cannot  direct  its  transaction  to  a  specific  location,  instead 
it  issues  a  request  that  will  search  the  caches  of  the  system  until  it 
finds  the  requested  data.  Determining  which  organization  will  be 
faster  for  a  particular  access  dqiends  upon  the  relative  locations  of 
shared  data  and  the  details  of  the  coherency  protocol. 
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Factor 

KSRl 

Alpha 

DASH 

Integer  Add 

44.0  ns 

6.1  ns 

33.9  ns 

IntegCT  Multiply 

925  ns 

98.1  ns 

3605  ns 

Integer  Divide 

4968.8  ns 

198.9  ns 

10233  ns 

F-Point  Add 

525  ns 

22.9  ns 

883  ns 

F-Point  Multiply 

22.1ns 

20.0  ns 

1473  ns 

F-Point  Divide 

1760.6  ns 

171.9  ns 

702.4  ns 

Complex  Arith. 

3199.1  ns 

182.7  ns 

7803  ns 

Intrinsic  Func. 

7969.6  ns 

11343  ns 

2683.0  ns 

Logical  Ops. 

1485  ns 

27.2  ns 

106.7  ns 

Branch/Switch 

148.2  ns 

15.0  ns 

36.4  ns 

Proc.  Calls 

757.2  ns 

62.0  ns 

379.6  ns 

Array  Indexing 

473  ns 

39.7  ns 

1114ns 

Loop  Overhead 

101.6  ns 

195  ns 

105.6  ns 

Table  1:  Single  CPU  Pwfonnance  of  the  KSRl,  DEC  Alpha 
4000/610.  and  DASH. 

4.  Summary  of  KSRl  CPU  Micro  Benchmark  Measure¬ 
ments 

As  mentioned  above,  a  micro  benchmaiit  suite  that  measures  CPU 
performance  has  previously  been  developed  and  used  to  obtain 
measurements  on  a  variety  of  uniprocessor  machines.  Results  for 
many  systems  have  been  reported  in  [12,  13].  We  have  run  the 
same  suite  on  die  KSRl  CPU. 

The  CPU  micro  benchmarks  are  machine-independent,  so  instead 
of  measuring  machine  instructions  they  measure  operations 
defined  in  a  high  level  abstract  machine.  The  abstract  machine  is 
based  on  the  Fortran  programming  language,  so  replications  writ¬ 
ten  in  Fortran  compile  directly  into  the  abstract  machine  code. 
The  numb®  and  type  of  operations  is  directly  related  to  die  kind  of 
language  constructs  present  in  Fortran.  Most  of  these  are  associ¬ 
ated  with  arithmetic  operations  and  trigonometric  functions.  In 
addition,  there  are  parameters  for  procedure  call,  array  index  cal¬ 
culation,  logical  operations,  branches,  and  do  loops. 

In  Table  1  we  present  the  results  we  obtained,  together  with  die 
results  of  running  die  same  micro  benchmarks  on  the  Stanford 
DASH  and  a  DEC  Alfha  400  model  610  system  running  at  160 
MHz.  The  processor  in  the  Stanford  DASH  is  the  MIPS  R3000 
running  at  33  MHz.  In  future  work,  we  plan  to  use  these  bench¬ 
marks  and  provide  a  comparison  of  the  KSRl  with  a  number  of 
other  parallel  machines. 

5.  Analysis  of  the  KSRl  Memory  Architecture 

As  discussed  above,  a  general  methodology  for  analyzing  the 
memory  behavior  of  machines  with  caches  has  been  described  in 
[13].  We  have  extended  this  methodology  with  additional  bench- 
ma^  that  measure  the  memory  hierarchy  behavior  of  shared 
memory  multiprocessors.  Here,  we  give  a  Wef  explanation  of  the 
approach,  and  present  the  results  we  have  obtained  for  the  KSRl. 

5.1.  Methodology 

There  are  many  speciHc  measurements  one  can  make  of  a  memory 
system.  In  general,  there  may  be  several  levels  of  cache  in  addi¬ 
tion  to  the  main  memory  in  the  system.  The  main  memory  may  be 
a  single  global  module  or  distributed  among  the  nodes.  If  the 
memory  is  distributed,  the  processors  may  treat  each  module  as 
local  memory,  or  may  share  the  memory  of  all  modules  in  a  glo¬ 
bal,  shared  address  space.  The  properties  of  the  memory  intercon¬ 
nect,  including  bandwidth  and  latency  under  a  variety  of  loads,  are 


of  interest  There  may  also  be  a  write  buffer  associated  with  each 
level  of  cache.  There  may  be  separate  cache  coherency  directories 
as  well  as  the  cadre  itself.  It  is  a  challenge  to  simply  measure  the 
performance  of  all  of  these  mechanisms  in  a  way  that  shows  what 
happens  rmder  a  variety  of  conditions.  It  is  clear  that  a  few  simple 
numbers  are  far  from  sufficiait  to  characterize  memory  behavior. 

Beyond  the  issite  of  obtaining  measurements  that  characterize  the 
behavior  of  the  memory  architecture  under  a  full  range  of  condi¬ 
tions,  there  is  the  problem  of  presenting  tire  results  in  a  more 
tt«»«nin£fii1  form  than  a  large  table  of  measurements,  or  reducing 
the  results  to  a  few  average  numbers.  Most  useful  would  be  a 
presentation  of  die  results  that  allows  a  programmer  with  a  specific 
qiplication  to  understand  what  the  memory  performance  of  his 
program  would  be.  We  have  developed  a  metiiod  of  displaying  the 
results  that  captures  a  significant  amount  of  information  in  graphi¬ 
cal  form.  We  called  these  diagrams  Physical  and  Performance 
Profiles  {P^  diagrams)  of  the  memory  subsystem  as  they  contain 
the  physical  characteristics  of  each  memory  structure  in  addition  to 
flic  performance  characteristics. 

5.2.  The  Structure  of  the  Physical  and  Performance  Pro¬ 
files 

Because  of  the  complexity  of  the  memory  ardiitectures  of  interest, 
the  P*  diagrams  require  some  effort  at  intapretation,  but  they  have 
a  great  advantage  as  compared  wifli  a  set  of  tables  of  results,  and 
provide  far  mote  information  than  averages  which  summarize  the 
measurements.  The  P*  diagrams  are  a  set  of  plots  rqiresenting  the 
average  execution  time  needed  to  read,  modify  and  write  (a  R-M- 
W  cycle)  a  single  element  in  a  sequence  of  locations  (not  neces¬ 
sary  contiguous)  taken  from  a  region  of  memory  as  a  function  of 
the  size  of  the  region  {R )  and  the  distance  (stride  S  )  between  con¬ 
secutive  elements.  An  alternative  experiment  consists  of  reading 
elonents  without  changing  their  values.  We  refer  to  fliis  type  of 
experimoit  as  read-use  cycle  (R-U  cycle). 

The  access  times  are  measured  by  timing  the  execution  of  a  For¬ 
tran  loop.  Each  Hata  point  cm  a  curve  is  the  mean  time  pier  itera¬ 
tion  calculated  from  performing  a  frxed  number  of  accesses  to  an 
array  of  the  given  size,  using  that  stride.  The  clock  resolution  of 
the  machine  is  20  psec.  By  factoring  out  loop  overhead  and 
averaging  over  a  large  number  of  iterations,  we  believe  that  the 
error  in  our  results  is  generally  less  than  a  clock  cycle. 

Depending  on  flie  relative  magnitudes  of  R  and  S  of  a  R-M-W 
experiment  with  respect  to  the  size,  width,  and  associativity  of  the 
structures  forming  the  memory  hierarchy  a  distinctive  value  for  the 
average  execution  time  is  obtained.  All  results  are  deleted  as  a 
set  of  curves,  where  each  curve  corresponds  to  a  particular  value 
of  P,  with  all  values  of  5=2"  from  1  to  P/2  plotted.  In  this  sec- 
tiem  we  briefly  explain  how  to  read  these  diagrams.  A  more  exten¬ 
sive  discussion  can  be  found  in  [14]. 

For  explanatory  purposes,  the  discussiem  will  fcKUS  in  the  effects 
of  our  experiments  on  a  memory  hierarchy  consisting  of  a  single 
cache.  The  explanation  extends  trivially  to  more  complex  hierar¬ 
chies  and  in  what  follows  we  provide  some  comments  in  this 
respect.  Depending  on  the  values  of  P  and  S  with  resp^  to  the 
size  of  the  cache  C ,  the  line  (blcrck)  size  b  and  associativity  a ,  we 
can  observe  one  of  four  basic  regimes.  Furthermore,  the  response 
of  a  more  complex  memory  hierarchy  is  just  the  superposition  of 
the  memory  structures’  individual  responses,  which  always  fit  one 
of  the  four  basic  regimes. 

In  the  presentation  we  assume  that  all  variables  take  values  that  are 
powers  of  two.  However,  if  one  of  the  physical  dimensions  of  a 
memory  stracture  happens  not  to  be  a  pow«  of  two,  it  will  be 
necessary  to  use  different  sequences  of  values  for  P  and  S. 
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Figure 


Eadi  micro  benchmark  consists  of  making  multiple  passes  over  an 
array  of  size  R ,  accessing  every  5"'  element  The  first  j^s  (at 
stride  1)  over  region  R  will  incur  some  cold  misses,  but  this  error 
is  negligible  due  to  the  length  of  the  micro  benchmark.  Micro 
^  benchmarks  at  larger  strides  will  incur  no  cold  misses,  as  they 

touch  a  smaller  set  of  elements. 

The  simplest  regime  (regime  1)  occurs  when  R  <C.  Here, 
independently  of  the  stride,  all  elements  accessed  by  die  experi¬ 
ments  fit  in  the  cadie,  so  there  are  no  cache  misses  in  the  steady- 
state  phase  of  the  experiment  Therefore,  the  average  time  of  the 
R-M-W  cycle  as  a  fiincdon  of  5  is  a  constant  line.  Curve  128Ar  in 
®  Figure  2  is  a  clear  example  of  regime 

When  R  >C,  misses  start  to  occur,  and  depoiding  on  5  we  can 
observe  one  of  three  regimes  (2.a,  b,  c).  Regime  2.a  occurs  when 
S  <b.  Here,  there  are  several  consecutive  accesses  to  each  cache 
line  in  betweai  correspon^ng  misses,  so  the  cadie  miss  penalty  is 
amortized  amongst  the  accesses.  As  S  grows,  the  average  time  for 
die  R-M-W  cycle  increases  in  proportion  to  5 .  All  curves  in  Fig- 
9  ure  2  where  R  ^512K  and  5  <64  correspond  to  regime  2.a. 

Regime  2.b  represents  die  situation  where  each  reference  falls  into 
a  differait  cache  line  and  it  always  generates  a  miss.  Formally, 
this  is  true  only  if  die  cache  replacement  policy  is  either  FIFO  or 


*  Figure  2  represents  the  siqietposition  of  the  effects  of 
two  memory  structures:  the  subcache  subblock  and  block  or¬ 
ganizations.  All  four  regimes,  however,  are  clearly  identifi¬ 
able  in  the  figure  and  we  make  reference  to  it  to  illustrate  the 
regimes. 


LRU.  For  a  random  rqilacement  policy,  the  effect  rapidly  con¬ 
verges  to  that  of  LRU  and  FIFO  as  the  number  of  lines  mapping  to 
a  set  increases  above  the  degree  of  associativity.  Regime  2.b 
occurs  when  b  <S  <R/a.  Here,  each  experiment  touches  a  sub¬ 
set  of  all  cadre  sets,  but  the  number  of  cache  lines  mapping  to  a  set 
is  greater  than  the  associativity.  This  result  follows  fiom  the  fol¬ 
lowing  argument.  There  are  Clab  sets  in  a  cache.  In  general,  an 
experiment  touches  R/(fcfS/fcl  )cadie  lines  which  are  mapped  into 
C/(a6fS/6l )  sets  if  5  SC/o  or  into  a  single  set  if  S  >C/a.  In 
regime  2.b,  S  ^b,  so  Sib  is  always  a  whole  number  greater  than 
one.  Therefore,  the  number  of  lines  touched  are  RtS  and  these  are 
mapped  into  either  ClaS  sets  or  a  single  one.  In  both  cases,  each 
set  receives  RalC  or  RIS  lines,  respectively,  and  it  follows  ftom 
condition^  >C  thatRa/C  >a  andRIS  >a. 

Therefore,  in  regime  2.b,  the  average  time  for  the  R-M-W  cycle  as 
a  function  of  S  is  constant,  assuming  there  are  no  other  effects 
produced  by  the  other  memory  structures,  hi  Figure  2,  regime  2.b 
corresponds  to  the  two  plateaus  present  in  all  curves  in  the  regions 
R  S512JS:and64SS  <256,and/f  2:512 and2Ar  <S  <RI4.  The 
last  regime  (2.c)  occurs  when  the  number  of  different  cache  lines 
mqiping  into  the  same  set  is  less  than  or  equal  the  set- 
associativity.  This  situation  is  characterized  by  condition 
Ria  <S  <R.  For  this  regime,  the  R-M-W  cycle  average  time 
drops  to  the  level  of  regime  1.  Furthermore,  the  ratio  RIS  at 
which  the  drop  occurs  gives  die  set-associativity  of  the  memory 
structure.  In  Figure  2  all  curves  where  R  ^  512Ar  exhibit  thk 
behavior  at  their  rightmost  point,  indicating  that  the  set- 
associativity  is  two. 

In  the  next  section,  we  discuss  some  specific  performance  charac¬ 
teristics  of  the  KSRl  that  are  observable  from  the  diagrams  our 
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KSRl  1-NODE  PHYSICAL  AND  PERFORMANCE  PROFILE:  R-U  CYCLE 
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Figure  3:  KSRl  single  node  Read-Use  cycle  physical  and  perfonnance  profile. 


experiments  generate.  In  the  diagrams  we  identify  the  KSRl 
regimes  by  using  roman  numerals  instead  of  the  basic  regime 
numbers.  We  do  this  to  avoid  confusion,  because  most  of  the 
KSRl  regimes  rqnesent  the  superposition  of  several  basic  regimes 
affecting  different  memory  structures  in  the  memory  hierarchy. 

S3.  KSRl  Single  Ring,  Single  Node  Performance 
Results 

In  Hgure  3  we  show  the  paformance  of  a  single  KSRl  node  while 
reading  from  the  cache.  The  figure  consists  of  curves  for 
regions  of  size  R=12&K  to  8MB ,  with  strides  of  8  bytes,  16  bytes, 
„.,  R/2  bytes.  The  KSRl  word  size  is  64  bits,  so  8  bytes  is  the 
smallest  stride. 

When  the  d«r«  set  being  accessed  is  smaller  titan  the  size  of  sub¬ 
cache  (regime  1)  there  will  be  no  cache  misses  and  we  can  read  the 
base  time  per  iteratiotL  The  flat  curve  for  the  128KB  data  set  in 
Figure  3  shows  this  case,  and  we  see  that  the  average  time  per 
jtiyatinTi  of  the  loop  for  all  strides  used  was  about  650 
nanoseconds.  This  is  the  time  to  perform  one  iteration  of  the  loop 
with  a  floating  point  add  and  multiply. 

Hie  size  of  flie  largest  such  curve  with  no  misses  tells  us  tiie  size 
of  the  subcache,  fri  this  case  we  see  that  the  subcache  is  2S6KB. 
The  line  is  not  completely  flat  due  to  interference  from  other  data 
used  by  the  process  -  the  data  set  is  the  same  size  as  the  cache  and 
any  accesses  to  other  data  will  cause  cache  misses. 

The  S12KB  and  larger  curves  show  us  what  happens  in  regimes 
2.8,  2.b,  and  2.c.  The  data  is  initiaUy  not  in  the  subcache  and  the 
first  referoice  to  a  subblock  will  cause  a  cache  miss;  succeeding 
references  to  the  same  subblock  will  hit  At  stride  8,  there  will  be 


one  miss  and  7  hits.  At  stride  16,  there  will  be  one  miss  and  only  3 
hits  to  each  subblock.  As  we  increase  the  stride,  we  decrease  the 
number  of  hits  and  the  cost  of  the  miss  is  amortized  over  fewer 
accesses.  At  a  stride  of  64  bytes,  the  curve  flattens  out,  as  every 
reference  is  made  to  a  different  subblock.  This  indicates  the  tran¬ 
sition  from  regime  2.a  to  2.b  and  we  are  able  to  conclude  that  the 
subblock  size  for  the  subcache  is  64  l^tes. 

We  can  also  read  the  time  taken  to  satisfy  a  miss  to  a  subblock  by 
measuring  the  difference  in  times  between  the  case  with  a  miss  on 
every  access  (a  data  set  of  S12KB  and  stride  of  64)  and  tiie  case 
with  no  misses  (a  data  set  of  128KB).  We  see  on  the  KSRl  that 
this  is  qrproximately  1300  nanoseconds. 

Between  stride  64  and  2048  the  curves  repeat  the  same  pattern  of 
rising  access  times.  This  regime  shows  Ae  effect  of  accessing  a 
new  block  in  the  subcache.  There  is  a  second  major  inflection  in 
the  curve  at  a  stride  of  2048  bytes;  this  correspond  to  the  case  in 
which  every  reference  is  to  a  new  blodc  of  the  subcache.  From 
these  curves,  we  are  able  to  deduce  that  there  is  a  directory  struc¬ 
ture  with  blocks  and  subblocks  which  manages  die  subcache 
(which  we  call  the  subcache  directory  -  Kendall  Square  Research 
has  not  published  any  information  about  this  aspect  of  the  architec¬ 
ture). 

At  a  stride  of  256K  bytes,  tire  512K  lyte  curve  shows  that  the  cost 
of  a  read  is  the  same  as  the  cost  when  there  are  no  subcache 
misses.  In  contrast,  if  the  stride  is  128K  bytes,  the  cost  of  a  read 
includes  the  cost  of  a  subcache  page  ntiss.  From  this,  we  can  con¬ 
clude  that  the  subcache  is  two-way  set  associative  because  at  a 
stride  of  256K  bytes  only  2  different  subblocks  are  being  accessed 
and  they  map  to  a  single  set  in  the  cache. 
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Figure  4:  Physical  and  performance  profiles  of  the  KSRl  subcache  and  cache  structures  obtained  using  2  competing  nodes  running 
R-M-W  based  experiments.  Eadh  regime  identifies  the  mean  execution  time  for  a  particular  combination  of  subcadie  and  cache 
miss  penalties.  For  array  sizes  smaller  then  16MB,  there  is  no  contention  between  the  nodes. 


Figure  2  shows  the  case  of  a  loop  that  reads  a  word,  modifies  it, 
and  writes  the  result  back  to  memory.  Compared  with  the  read 
case  shown  in  Figure  3,  there  are  extra  costs  b^use  the  blocks  of 
the  subcache  have  been  modified.  This  increases  the  subcache  sub¬ 
block  miss  penalty  by  about  S60  nanoseconds  (approximately  11 
clodc  cycles),  and  die  cost  of  a  subcache  block  miss  increases  by 
about  300  nanoseconds  (6  clock  cycles).  We  note  that  the  base 
case  is  650  nanoseconds,  as  it  was  for  the  read  case.  From  this  we 
conclude  that  the  write  is  con^letely  overlt^ipied  with  the  loop 
branch  and  other  operations. 


5.4.  Subcache  Random  Replacement  Policy 

In  Figure  3  we  can  also  observe  die  fact  that  the  subcache  replace¬ 
ment  policy  is  random.  This  manifests  itself  in  the  height  of  curve 
S12K  which  reaches  a  lower  height  than  the  other  curves  in  regime 
m.  Regime  IQ  corresponds  to  basic  regime  2.b.  Tliis  basic 
regime  assumes  either  an  LRU  or  FIFO  replacement  discipline  to 
enforce  that  every  reference  generates  a  miss  (a  subcache  subblock 
and  block  misses  in  this  case).  With  random  replacement,  how¬ 
ever,  some  subset  of  the  references  will  not  cause  misses. 


As  mentioned  in  section  5.2,  in  a  cadie  of  size  C  and  associativity 
a,  an  experimoit  covering  a  region  R  wiU  RalC  >  a  cache 
block  into  the  same  set  Now,  in  a  random  discipline,  the  proba- 
bili^  Phtv  fiutt  >  cache  block  will  remain  in  the  cache  after  a  pass 
through  all  elements  in  the  experiment  is 


Pturv 


a 


(1) 


Consequently,  the  average  execution  time  per  R-M-W  cycle  in 
regime  IQ  (Tm )  should  be; 

Tin  =  r«>-«iji +(1 -Pjwv)Oiiii».  (2) 

where  and  are,  respectively,  the  average  execution 

time  without  misses  and  the  miss  delay  penalty.  Now,  if  we 
replace  die  KSRl  parameters  in  eqs.  (1)  and  (2)  we  get  that  the 
effective  subcache  miss  delay  penalty  for  curve  512K’  should  be 
7/8  =  0.875  of  .  The  results  in  Figure  3  for  curve  512Ar  exhi¬ 
bit  an  effective  delay  penalty  in  die  range  .86  to  .88. 

A  more  subde  manifestation  of  the  random  replacement  policy  in 
Figures  2  and  3  is  the  decrease  in  the  effective  delay  penal^  for 
the  points  S  >C la  (128K)  in  curves  IM  and  higher.  In  all  these 
points  there  are  R/S  cache  lines  m^ing  to  a  single  set  and  as  S 
increases  fewer  lines  are  mapped  to  the  set  A  similar  argumoit  to 
diat  given  above  applies,  except  that  now  the  exponent  is  R/S  - 1. 
We  can  see  that  the  second  point  from  die  right  (identified  by  sym¬ 
bol  §),  which  corresponds  to  stride  S  =R/4,  should  also  have  an 
effective  delay  penalty  7/8.  Figure  3  shows  that  the  miss  penalty 
drops  to  the  same  value  of  curve  512Ar. 

The  previous  discussion  illustrates  how  effective  the  diagrams 
are  in  capturing  die  complex  performance  space  exhibited  by 
shared  memory  machines. 

5.5.  KSRl  Single  Ring,  Two  Nodes  Performance  Results 

It  is  fairly  expensive  to  use  data  diat  resides  in  another  node  on  the 
same  ring  in  the  KSRl.  Figure  4  {see  also  Figures  C-3  and  C-4 
shown  on  the  color  plate  page)  shows  a  set  of  curves  for  read- 


Figure  5:  Physical  and  performance  profiles  of  the  KSRl  subcache  and  cache  structures  obtained  using  2  competing  nodes  running 
R-M-W  based  experiments.  Each  regime  identifies  the  mean  execution  time  for  a  particular  combination  of  subcache  and  cache 
miss  penalties.  For  array  sizes  smaller  then  16MB,  there  is  no  contention  between  the  nodes. 


modify-write  (as  was  the  case  for  Figure  2).  when  the  data  sets  are 
as  large  or  largo’  than  the  local  cache.  The  32  MByte  line  is  par¬ 
ticularly  interesting.  Not  all  the  data  can  fit  in  the  local  cache  at 
one  time,  since  some  space  is  needed  for  the  program,  and  perhaps 
for  the  operating  system  or  other  pages  for  which  the  executing 
node  is  the  home  cache.  But  clearly  there  is  enough  reuse  of 
blocks  that  even  at  large  strides  (8K  to  2M),  the  average  cost  of  the 
memory  accesses  is  somewhat  less  than  the  cost  of  large  strides  for 
larger  data  sets. 

As  we  mentioned  in  section  2,  the  KSRl  local  cache  is  32  MBytes 
arxl  16-way  associative.  However,  some  pages  are  "wired"  by  the 
operating  system,  so  th^  cannot  be  selected  as  victims.  In  addi¬ 
tion,  the  local  cache  acts  as  home  of  some  fraction  of  the  user’s 
pages.  Hence,  the  effective  cache  size  that  a  program  sees  is  sigra- 
ficantly  less  than  the  32  MBytes  and  the  effective  set  associativity 
is  less  than  16. 

Both  of  these  characteristics  are  clearly  present  in  Figure  4.  For 
example,  the  32Af  curve,  which  in  principle  should  fit  in  the 
cache,  clearly  shows  the  presence  of  page  misses  with  replace¬ 
ment.  If  the  entire  32  MBytes  were  available,  die  curve’s  shape 
should  be  the  same  as  the  16  MByte  curve.  TTie  reason  why  die 
curve  reaches  its  highest  point  b^ween  strides  16K  and  64K  and 
then  drops,  is  because  the  total  numbo’  of  pages  touched  by  a  par¬ 
ticular  experiment  is  constant  for  all  strides  less  than  16K  and  then 
it  decreases  in  proportion  to  the  stride. 

With  respect  to  the  set  associativity,  the  three  rightmost  points  of 
all  curves  having  regions  greater  than  16  MBytes  indicate  that  the 
cache  associativity  is  8-way.  We  can  see  this  by  noting  that  when 


the  larger  curves  have  a  stride  of  1/8  their  size  (e.g.  4MB  stride  in 
32MB  data  set),  their  access  time  drops  back  to  the  subcache  block 
miss  level.  This  occurs  because  we  are  referencing  only  8  dif¬ 
ferent  local  cache  pages  and  thty  map  to  a  single  set  in  the  local 
cache.  This  value  is  less  than  the  expected  16-way.  Because  our 
experiments  change  R  and  S  only  in  powers  of  two,  we  detect  8- 
way  associativity,  while  its  real  value  can  be  any  number  between 
8  and  16.  We  have  performed  more  detailed  experiments  and  have 
found  that  the  effective  associativity  varies  from  set  to  set  in  the 
range  from  3  to  12. 

Finally,  from  Figure  4,  we  observe  that  the  cost  of  ranoving  and 
replacing  a  page  in  the  cache  is  quite  large  (about  500  psec).  It 
should  be  borne  in  mind  that  one  or  more  blocks  in  each  page 
being  replaced  have  been  modified  (are  in  tfie  exclusive  state),  and 
must  be  evicted  before  a  new  block  of  the  new  page  can  be 
retrieved.  This  is  a  form  of  page  swapping  or  thrashing  betwe«i 
memories  in  the  same  ring,  and  does  i»t  require  that  the  pages 
being  evicted  be  written  to  disk. 

5.6.  KSRl  Two  Rings,  Two  Nodes  Performance  Results 

So  far,  we  have  discussed  the  performance  of  a  single  node  in 
accessing  data,  though  the  data  may  reside  on  more  than  one  rude. 
We  now  turn  our  attention  to  the  case  in  which  multiple  nodes  are 
writing  to  the  same  set  of  data.  Figure  5  (see  also  Figures  C-1  and 
C-2  shown  on  the  color  plate  page)  shows  an  experiment  in  whidi 
two  nodes  in  the  same  ring  access  data.  This  figure  is  like  Figure 
3,  but  with  two  additional  curves  labeled  "16M",  for  a  16  MByte 
data  set.  Two  processes  on  different  nodes  are  simultaneously 
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Summary  of  the  KSRl  Micro  Benchmark  Results  Using  One  and  Two  Nodes 


Regime 

Misses 

Rings? 

Evict 

dirty 

page 

R-M-W  Cycle  Iteration  Time 

Sub  Cache 

Local  Cache 

Total  Time 

Residual  Time 

subblock 

block 

subpage 

page 

time 

time 

Miss 

I 

no 

no 

no 

no 

n.a. 

n.a. 

0.65  |ts 

— 

baseline 

n 

yes 

no 

no 

no 

n.a. 

n.a. 

2.50  ps 

1.85  ps 

subblock 

m 

yes 

yes 

no 

no 

n.a. 

n.a. 

4.20  ps 

1.70  ps 

block 

rv 

yes 

no 

yes 

no 

same 

n.a. 

10.10  ps 

7.60  ps 

subpage 

V 

yes 

yes 

yes 

no 

same 

no 

11.80  ps 

1.70  ps 

blodc 

VI 

yes 

yes 

yes 

yes 

same 

yes 

520.00  ps 

508.00  ps 

page 

vn 

yes 

yes 

yes 

no 

diff 

n.a. 

28.60  ps 

26.10  ps 

subpage 

vm 

yes 

yes 

yes 

yes 

diff 

no 

31.10ps 

2.50  ps 

block 

Table  2:  Mean  execution  time  for  a  single  read-modify-write  iteration.  Each  regime  rqrresents  a  combination  of  /?  and  5  producing  a 
particular  pattern  of  misses  to  a  subset  of  the  memory  hierarchy.  Coliunn  "Evict  dirty  page"  shows  the  delay  involved  in  moving  a 
dirty  page  out  of  a  cache  after  a  cache  miss.  The  dirty  page  has  to  be  sent  back  to  the  "Iwme"  node. 


accessing  the  data  set  in  read-modify-write  mode,  for  different 
strides.  In  one  case,  the  two  nodes  are  on  the  same  ring,  and  in  the 
second  case,  they  are  on  different  rings.  The  figure  shows  that  the 
cost  of  a  miss  in  the  local  cache  is  about  7.6  psec  (at  stride  128, 
the  size  of  a  subpage,  every  read  misses),  if  the  subpage  is  in 
another  cache  on  the  same  ring.  Note  that  the  total  cost  of  the 
access  is  the  sum  of  all  the  misses,  about  10.1  psec. 

The  cost  of  accessing  smaller  data  sets  will  be  the  same  as  shown 
in  the  16M  curve,  since  the  two  processors  wiU  be  invalidating 
each  others’  cache  subpages.  The  experiment  was  designed  so  that 
die  starting  point  for  each  processor  was  separated  by  1/2  the  size 
of  the  data  set  (i.e.,  8  megabytes  ^art).  In  this  way,  the  processors 
are  not  competing  for  the  same  data  at  the  same  time  except  at 
large  strides. 

From  the  curve  for  the  case  of  two  processes  located  in  nodes  on 
different  rings,  the  subpage  miss  penalty  is  27.8  psec.  The 
machine  used  for  the  experiment  had  two  ring;0  rings  intercon¬ 
nected  by  a  ring:l  ring.  We  assume  that  there  were  only  two  ring 
interfaces  in  ringil.  If  traversal  time  in  ring:0  is  about  IJS  psec  for 
33  nodes  (including  the  ring  interface),  then  about  13  psec  are 
consumed  in  the  ring  interfaces  and  in  traversing  ring;l.  Since  the 
per  link  data  rate  in  ting:l  is  the  same  as  ring;0  for  this  machine, 
the  cost  of  the  directory  operations  and  ring  insertions  would 
appear  to  be  consuming  most  of  this  time. 

The  results  displayed  in  Figures  4  and  5  show  eight  performance 
regimes  (indicated  by  roman  numerals  in  the  figures).  Each  regime 
is  determined  Ity  the  number  the  misses  it  triggers  in  a  number  of 
enclosing  levels  of  the  hierarchy.  Different  regimes  also  identify 
tile  locations  from  which  a  miss  can  be  satisfied. 

Regime  I  (Figure  5)  represents  the  baseline  time;  the  case  what  no 
misses  are  triggwed  by  the  micro  benchmark.  Regime  II  adds  to 
the  baseline  the  delay  due  to  subcache  subblock  misses,  while 
regime  m  includes  both  subblock  and  block  miss  delays. 
Regimes  IV  and  VD  contain  the  effect  of  local  cache  subpage 
misses.  The  first  ciptures  the  case  when  the  miss  is  satisfied  by  a 
node  in  the  same  ring,  while  the  second  represents  reading  the  data 
from  a  remote  ring.  In  all  regimes,  except  VI,  the  region  of  data 
covered  by  the  micro  benchmarks  is  less  than  the  size  of  the  local 
cache.  Hence,  all  subpage  misses  occurring  in  these  regimes  are 
only  the  result  of  mutual  invalidations  betweai  the  nodes,  because 
both  nodes  need  exclusive  rights  over  the  subpage. 


read-modify-write 


average  execution  time 


I  cache  subpage  miss 

I  subcache  blocfc  miss  7600  ns  (152) 

subcache  subbiock  miss  (34) 

R-M-W  baseline  1850  ns  (37) 

650  ns  (13) 

Figure  6:  Components  of  regime  V  (ringK)  latency). 


In  regime  VI,  on  the  other  hand,  satisfying  a  miss  requires  either 
evicting  a  dirty  page  if  the  micro  benchmark  is  based  on  the  R-M- 
W  cycle  or  detecting  that  the  page  is  not  dirty  and  just  dropping  it 
if  it  is  based  on  the  R-U  cycle.  In  the  former  case  the  complete 
dirty  page  has  to  be  sent  to  the  "home"  node.  In  both  situations 
there  is  a  significant  extra  penalty  involved.  The  results  just  dis¬ 
cussed  are  summarized  in  Table  2. 

Figure  6  shows  the  component  times  of  a  memory  access  in 
regime  V. 

6.  Communications  Performance  of  the  KSRl 

The  performance  of  the  interconnection  networit  has  a  significant 
effect  on  tiie  overall  performance  of  parallel  computations  and 
greatly  affecu  the  granularity  achievable.  The  experiments 
reported  in  the  previous  section  which  measmed  the  performance 
of  the  memory  hierarchy  were  carefully  designed  to  minimize  the 
effects  of  loading  of  the  corrununications  network.  For  real  pli¬ 
cations,  both  memory  performance  and  corrununications  networit 
performance  will  affect  the  overall  rate  of  computation.  In  this 
section,  we  report  on  our  experiments  to  investigate  the  perfor¬ 
mance  of  the  KSRl  ring  interconnection  network. 

The  experiments  are  similar  to  our  memory  experiments.  A  single 
shared  array  is  accessed  by  several  nodes.  The  array  is  divided 
into  equal  portions,  with  each  portion  accessed  in  a  read-modify- 


RING-0  LATENCIES 


RING-1  LATENCIES 


write  cycle  by  only  two  nodes.  The  placement  of  the  nodes  and 
die  assignment  of  portions  of  the  array  to  nodes  is  carefully 
designed  so  that  all  nodes  execute  their  portion  of  the  experiment 
in  in  about  the  same  amount  of  time.  (This  requires  care;  it  is  easy 
to  construct  experiments  in  which  the  performance  of  some  of  the 
nodes  is  much  worse  flian  other  nodes,  even  though  all  are  doing 
die  same  amount  of  work). 


.  (3) 

^siea  Ptttffty 

Eq.  (3)  does  not  apply  to  the  case  when  nodes  in  different  rings 
communicate.  Unfortunately,  we  do  not  have  enough  information 
about  ring:l  and  die  interfaces  between  ring:0  and  ringrl  to  pro¬ 
duce  a  realistic  analytical  model  in  this  case. 


6A.  Contention  on  a  Single  Ring:  Analytic  Model 

Here  we  present  a  simple  model  for  the  extra  delay  in  latency  due 
to  contention  in  the  ring  in  the  case  whai  all  communication  is 
local  to  ring:0.  We  thoi  conqiare  the  model  against  experimental 
results.  In  our  eiqieiiments,  the  time  pw  iteration  can  be  bro¬ 
ken  into  two  components:  time  of  computation  )  and  time  for 
communication  (team)-  Term  Icam  has  two  additional  com¬ 
ponents:  the  communication  time  without  contention  )  and 

the  extra  delay  due  to  contention  The  later  term  is 

a  function  of  the  number  of  communicating  nodes  (««*,).  Let 
Wjfcw  be  the  number  of  slots  in  ring:0.  On  the  average,  when  a 
node  wants  to  drop  a  packet  into  the  ring,  there  are 
I-  -  lyt^-oMltiur  other  messages  occupying  slots.  The  proba¬ 
bility  that  a  random  slot  is  empty  is  given  by 


Porpty  —  1 


Now,  die  number  of  consecutive  occupied  slots  passing  through  a 
node  before  an  empty  slot  is  found  follows  the  geometric  distribu¬ 
tion  with  parameter  Paipiy  •  Hence,  the  expected  number  of  con¬ 
secutive  occupied  slots  is  (1  -Pa^)lpaipty^-  Given  that  the  time 
between  successive  slots  is  we  can  compute  die 

eiqiected  extra  delay  in  latency  due  to  contention  as 


2  The  mean  number  of  slots  passing  through  a  node,  in¬ 
cluding  the  empty  one,  is  given  by 


I  Pa<ptj 


+  1. 


Ptupiy 


6.2.  Experimental  Results 

The  grqihs  in  Figure  7  shows  the  experimental  values  for 
team,  +  tpem,uy(.thada)  for  varfous  numbers  of  nodes.  The  figure 
labeled  "RING-0  LATENCIES"  also  shows  the  latency  predicted 
by  eq.  (3)  in  the  single  ting  case. 

The  ring-0  results  in  Figure  7  clearly  show  that  in  the  case  of  a  sin¬ 
gle  ring,  even  when  the  latency  tends  to  increase  with  the  number 
of  nodes,  this  increase  is  rdatively  modest.  In  fact  the  total 
increase  in  latency  going  from  2  to  32  nodes  is  less  than  15%  of 
the  original  team,-  There  is  a  small  error  between  the  analytical 
and  experimaitd  results  which  increases  with  the  number  of 
nodes.  The  maximum  error  observed  is  less  than  11%^.  This  is 
because  eq.  (3)  overestimates  fpeMi/o  by  assuming  that  tiur  is 
indqiendent  of  the  number  of  nodes  in  the  experiment.  In  actual¬ 
ity  the  time  per  iteration  is  given  Ity 

”  teg/fp  tjM-ncaii/  "t"  IpMflily  )* 

Considering  this  new  term  in  (1)  reduces  the  error  between  the 
experimental  and  analytical  results  to  apixoximately  less  than  5 
percent. 

The  contention  penalty  when  node  commumcation  requires  send¬ 
ing  messages  through  ring:l  shows  a  more  interesting  behavior. 
Figure  7.b  distinctly  shows  two  performance  regimes,  one  for  less 
dian  32  nodes  and  another  for  more  than  32  nodes.  In  the  former 
case  increasing  the  number  of  nodes  Ity  one,  on  the  average 

3  The  11%  error  in  t^,aii)  represents  only  a  2%  error  in 
terms  of  the  total  communication  latency. 
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increases  latency  ly  152  ns  C  3  cycles),  while  in  the  later  case,  the 
increment  is  as  large  as  1000  ns  C  20  cycles).  A  simple  model 
based  on  linear  fit  of  both  regimes  gives  the  following  formulas: 


f  152ns  xn  +  25200ns 


L(n)  = 


1004ns  x(«  -32)  +  27500ns 


for  n  <  32 

(4) 

for  n  >  32. 


with  respective  correlation  coeffidents  of  0.9713  and  0.9116. 
When  the  number  of  active  nodes  increases  from  two  to  32  and  60, 
flien  eq.  (4)  gives  a  relative  increase  of  21%  and  110%  respec¬ 
tively  in  the  total  latency. 

From  the  data  presented  in  Figure  7,  we  can  estimate  die  data  rate 
per  TKxle  under  various  communications  loads.  The  conflict-free 
rate  for  a  node  is  about  17  megabytes/second  (a  128  byte  subpage 
every  7.6  psec).  As  the  load  gets  heavy  within  a  single  ring,  with 
the  ratio  between  requests  for  data  (cache  misses)  and  computation 
of  our  experiments,  the  rate  declines  to  about  15  megabytes/second 
when  all  32  nodes  are  generating  requests.  When  data  moves 
between  rings,  rates  are  much  lower.  The  range  is  5  megabytes  for 
one  node  without  contention  to  about  4  megabytes  when  32  nodes 
are  active,  and  declining  to  less  than  2.5  megabytes  per  second  per 
node  with  a  load  of  60  nodes. 


7.  Related  Work 

Recently  several  other  researchers  have  been  investigating  the  per¬ 
formance  of  the  KSRl.  Boyd  et  al  [1]  show  a  method  of  measur¬ 
ing  communications  performance  on  multiprocessors  using  a  syn¬ 
thetic  workload  based  on  matrix  multiplication  of  generated 
matrices.  Other  researchers  [10,  9]  have  also  reported  on  experi¬ 
ments  to  measure  die  performance  effects  of  specific  features  of 
the  KSRl.  The  communication  and  synchronization  performance 
of  the  KSRl  has  been  analyzed  by  Dunigan  [4].  Singh  et  al  [15] 
present  performance  results  of  several  kernel  codes  and  some  of 
die  SPLASH  beiKhmaik  suite  on  the  KSRl  and  DASH  machines. 
Finally,  analytic  model  comparing  the  potential  benchmark  perfor¬ 
mance  of  NUMA  and  COMA  machines  has  been  developed  by 
Hagersten  [5]. 

8.  Conclusions 

Based  on  our  measurements,  it  qipears  that  the  KSRl 
ALLCACHE  memory  architecture  should  gracefully  extend  to  a 
large  number  of  processors.  It  indeed  fulfills  its  promise  of  a  scal¬ 
able  shared  memory  architecture.  As  with  any  parallel  machine, 
the  performance  of  parallel  applications  will  depend  on  the  degree 
and  form  of  interactions  between  computations  running  on 
separate  nodes  of  the  machine.  From  our  results,  it  is  clear  that 
there  are  types  of  shared  access  that  are  expensive,  and  that  the 
progranuner  should  be  aware  of  the  costs  of  accessing  data,  espe¬ 
cially  at  larger  strides,  diat  reside  on  a  different  node  from  die 
accessing  node. 

The  overall  performance  of  any  machine  is  a  combination  of  its 
many  features.  The  ALLCACHE  memory  is  only  one  component 
of  the  KSRl.  In  addition  to  the  processor,  the  ring  of  rings  inter¬ 
connect  is  a  major  element.  The  architecture  is  interesting  and 
effective.  Our  results  do  not  permit  us  to  give  a  relative  evaluation 
of  the  machine  in  comparison  with  other  architectures,  but  our  data 
wilL  we  believe,  help  potential  users  in  deciding  whether  to  use 
die  machine,  and  how  to  use  it  effectively. 
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Rg.  C-1;  KSR1  2-Node  Physical  and  Performance  Profile  (Fig.  5). 
The  projection  is  taken  from  point  {-2.5,  -1.7,  +0.5}. 


Fig.  C-2:  KSR1  2-Node  Physical  and  Performance  Profile  (Fig.  5) 
The  projection  is  taken  from  point  {-2.5,  +1.7,  +0.5}. 
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Rg.  C-3;  KSR1  1-Node  Physical  and  Performance  Profile  (Fig.  4). 
The  projection  is  taken  from  point  (-2.5,  -1.7,  +0.5}. 
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Fig.  C-4:  KSR1  1-Node  Physical  and  Performance  Profile  (Fig.  4) 
The  projection  is  taken  from  point  {-2.5,  +1.7,  +0.5}. 
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Rg.  C-5;  KSR1  2-Node  Physical  and  Performance  Profile  (Fig.  5). 
Cache  misses  satisfied  in  local  ringO  are  shown. 
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Fig.  C-6:  KSR1  1-Node  Physical  and  Performance  Profile  (Fig.  4). 
The  effect  of  misses  with  page  replacement  is  shown. 


